Single cell RNA-seq data is simulated to represent a situation in which 2 groups of cells generated through some experimental procedure are found to have heterogenous expression in a number of genes. Both groups also possess genes that are differentially expressed compared to those of a group of control cells.
We will show that the two groups of cells subjected to the experimental procedure are indistinguishable when subjected to dimension reduction techniques that do not take into account the information stored in the control cells.
# simulate the three groups of cells such that cell heterogeneity is masked by
# some batch effect
params <- newSplatParams(
seed = 6757293,
nGenes = 500,
batchCells = c(150, 150),
batch.facLoc = c(0.05, 0.05),
batch.facScale = c(0.05, 0.05),
group.prob = rep(1/3, 3),
de.prob = c(0.1, 0.05, 0.1),
de.downProb = c(0.1, 0.05, 0.1),
de.facLoc = rep(0.2, 3),
de.facScale = rep(0.2, 3)
)
sim_groups_sce <- splatSimulate(params, method = "groups")
# get the logcounts of the data
sim_groups_sce <- normalize(sim_groups_sce)
# remove all cells without variation in counts
sim_groups_sce <- sim_groups_sce[which(rowVars(counts(sim_groups_sce)) != 0), ]
We take the first two principal components of the entire dataset to illustrate that the variance caused by the batch effect dominates all other signals in the data.
Now, we focus on applying variuos dimension reduction techniques to the target data, i.e. the cells that were subjected to some experimental procedure. The transcriptome data belonging to the control cells is used as a background dataset for cPCA and scPCA.
| Gene | DEFacGroup1 | DEFacGroup3 | diff | scPCA1 |
|---|---|---|---|---|
| Gene327 | 1.0000000 | 2.2944560 | 1.2944560 | 1 |
| Gene201 | 2.0011872 | 1.0000000 | 1.0011872 | 0 |
| Gene346 | 1.0000000 | 1.9037754 | 0.9037754 | 1 |
| Gene240 | 1.0000000 | 1.7944498 | 0.7944498 | 0 |
| Gene128 | 1.7802675 | 1.0000000 | 0.7802675 | 1 |
| Gene473 | 1.0000000 | 1.7703224 | 0.7703224 | 0 |
| Gene214 | 1.6983572 | 1.0000000 | 0.6983572 | 0 |
| Gene188 | 1.0000000 | 1.6860079 | 0.6860079 | 1 |
| Gene307 | 1.6467488 | 1.0000000 | 0.6467488 | 1 |
| Gene192 | 1.6426308 | 1.0000000 | 0.6426308 | 1 |
| Gene44 | 1.6113054 | 1.0000000 | 0.6113054 | 1 |
| Gene454 | 1.0000000 | 1.5922447 | 0.5922447 | 1 |
| Gene270 | 1.0000000 | 1.5701455 | 0.5701455 | 0 |
| Gene304 | 1.0000000 | 1.5430068 | 0.5430068 | 0 |
| Gene190 | 1.0000000 | 1.5176864 | 0.5176864 | 1 |
| Gene383 | 0.4846386 | 1.0000000 | 0.5153614 | 1 |
| Gene158 | 1.5126046 | 1.0000000 | 0.5126046 | 1 |
| Gene8 | 1.5483328 | 1.0770052 | 0.4713276 | 1 |
| Gene370 | 1.4712158 | 1.0000000 | 0.4712158 | 1 |
| Gene364 | 1.4459099 | 1.0000000 | 0.4459099 | 1 |
| Gene66 | 1.0000000 | 1.4211804 | 0.4211804 | 1 |
| Gene68 | 1.0000000 | 1.3941840 | 0.3941840 | 1 |
| Gene10 | 1.3941636 | 1.0000000 | 0.3941636 | 0 |
| Gene54 | 1.0000000 | 1.3759716 | 0.3759716 | 1 |
| Gene315 | 1.0000000 | 1.3691284 | 0.3691284 | 0 |
| Gene3 | 1.3611369 | 1.0000000 | 0.3611369 | 0 |
| Gene135 | 1.3587745 | 1.0000000 | 0.3587745 | 0 |
| Gene334 | 1.0000000 | 0.6421265 | 0.3578735 | 0 |
| Gene196 | 1.3468610 | 1.0000000 | 0.3468610 | 1 |
| Gene245 | 1.0000000 | 1.3271180 | 0.3271180 | 0 |
| Gene220 | 1.3079478 | 1.0000000 | 0.3079478 | 0 |
| Gene342 | 1.0000000 | 1.2988780 | 0.2988780 | 0 |
| Gene380 | 1.2972411 | 1.0000000 | 0.2972411 | 0 |
| Gene228 | 1.2931549 | 1.0000000 | 0.2931549 | 0 |
| Gene363 | 0.7128209 | 1.0000000 | 0.2871791 | 0 |
| Gene229 | 1.2861742 | 1.0000000 | 0.2861742 | 0 |
| Gene80 | 0.7147228 | 1.0000000 | 0.2852772 | 0 |
| Gene100 | 1.0000000 | 1.2837038 | 0.2837038 | 0 |
| Gene338 | 1.2741687 | 1.0000000 | 0.2741687 | 0 |
| Gene275 | 1.0000000 | 1.2706184 | 0.2706184 | 0 |
| Gene108 | 1.0000000 | 1.2704223 | 0.2704223 | 0 |
| Gene436 | 0.7300889 | 1.0000000 | 0.2699111 | 0 |
| Gene143 | 1.2692307 | 1.0000000 | 0.2692307 | 0 |
| Gene254 | 1.2596478 | 1.0000000 | 0.2596478 | 0 |
| Gene353 | 1.0000000 | 1.2574056 | 0.2574056 | 0 |
| Gene489 | 1.0000000 | 1.2499518 | 0.2499518 | 0 |
| Gene285 | 1.2458224 | 1.0000000 | 0.2458224 | 1 |
| Gene103 | 1.2402441 | 1.0000000 | 0.2402441 | 0 |
| Gene482 | 1.0000000 | 1.2327516 | 0.2327516 | 0 |
| Gene258 | 1.0000000 | 0.7857538 | 0.2142462 | 0 |
| Gene218 | 1.0000000 | 1.2088149 | 0.2088149 | 0 |
| Gene458 | 1.0000000 | 1.1934206 | 0.1934206 | 0 |
| Gene235 | 1.1802687 | 1.0000000 | 0.1802687 | 1 |
| Gene197 | 1.0000000 | 0.8232851 | 0.1767149 | 0 |
| Gene453 | 1.0000000 | 1.1731249 | 0.1731249 | 0 |
| Gene36 | 1.0000000 | 0.8368966 | 0.1631034 | 0 |
| Gene28 | 1.0000000 | 1.1586270 | 0.1586270 | 0 |
| Gene193 | 1.0000000 | 1.1564362 | 0.1564362 | 0 |
| Gene55 | 1.1393105 | 1.2937132 | 0.1544027 | 0 |
| Gene302 | 1.0000000 | 1.1483239 | 0.1483239 | 0 |
| Gene238 | 1.1441278 | 1.0000000 | 0.1441278 | 0 |
| Gene30 | 1.0000000 | 1.1425416 | 0.1425416 | 0 |
| Gene75 | 1.0000000 | 1.1370305 | 0.1370305 | 0 |
| Gene11 | 1.0000000 | 1.1324617 | 0.1324617 | 0 |
| Gene424 | 1.1310105 | 1.0000000 | 0.1310105 | 0 |
| Gene70 | 1.0000000 | 0.8755727 | 0.1244273 | 0 |
| Gene169 | 1.0000000 | 1.1230450 | 0.1230450 | 0 |
| Gene405 | 1.1723449 | 1.0548861 | 0.1174588 | 0 |
| Gene250 | 1.0000000 | 0.8854873 | 0.1145127 | 0 |
| Gene46 | 1.0000000 | 1.1050231 | 0.1050231 | 0 |
| Gene145 | 1.0000000 | 1.0937630 | 0.0937630 | 0 |
| Gene374 | 1.0918226 | 1.0000000 | 0.0918226 | 0 |
| Gene399 | 1.0000000 | 1.0807069 | 0.0807069 | 0 |
| Gene484 | 1.0773906 | 1.0000000 | 0.0773906 | 0 |
| Gene475 | 1.0760033 | 1.0000000 | 0.0760033 | 0 |
| Gene222 | 1.0737436 | 1.0000000 | 0.0737436 | 0 |
| Gene202 | 1.6427858 | 1.7156525 | 0.0728667 | 0 |
| Gene278 | 1.0726935 | 1.0000000 | 0.0726935 | 0 |
| Gene132 | 1.0000000 | 1.0715175 | 0.0715175 | 0 |
| Gene468 | 1.0000000 | 0.9316385 | 0.0683615 | 0 |
| Gene292 | 1.0627392 | 1.0000000 | 0.0627392 | 0 |
| Gene126 | 1.0621330 | 1.0000000 | 0.0621330 | 0 |
| Gene239 | 1.0000000 | 1.0619006 | 0.0619006 | 0 |
| Gene116 | 1.0000000 | 1.0596130 | 0.0596130 | 0 |
| Gene231 | 1.0000000 | 1.0546435 | 0.0546435 | 0 |
| Gene118 | 1.0474613 | 1.0000000 | 0.0474613 | 0 |
| Gene256 | 1.0406122 | 1.0000000 | 0.0406122 | 0 |
| Gene227 | 1.0000000 | 1.0400291 | 0.0400291 | 0 |
| Gene455 | 1.0373562 | 1.0000000 | 0.0373562 | 0 |
| Gene347 | 1.3211550 | 1.2865117 | 0.0346433 | 0 |
| Gene403 | 1.0000000 | 1.0307340 | 0.0307340 | 0 |
| Gene309 | 1.0181823 | 1.0000000 | 0.0181823 | 0 |
| Gene416 | 1.1885126 | 1.1734931 | 0.0150194 | 0 |
| Gene461 | 1.0000000 | 1.0108062 | 0.0108062 | 0 |
| Gene291 | 1.0089752 | 1.0000000 | 0.0089752 | 0 |
| Gene396 | 1.0000000 | 1.0071671 | 0.0071671 | 0 |
| Gene123 | 1.0041118 | 1.0000000 | 0.0041118 | 0 |
| Gene434 | 1.0000000 | 1.0029095 | 0.0029095 | 0 |
Of the 98 differentially expressed genes, scPCA identified the most prominent. Of the 20 genes with non-zero values in the first row of scPCA’s loading matrix, 20 corresponded to differentially expressed genes.
## Computing the multiple Kernels.
## Performing network diffiusion.
## Iteration: 1
## Iteration: 2
## Iteration: 3
## Iteration: 4
## Iteration: 5
## Iteration: 6
## Iteration: 7
## Iteration: 8
## Iteration: 9
## Iteration: 10
## Performing t-SNE.
## Epoch: Iteration # 100 error is: 0.1292803
## Epoch: Iteration # 200 error is: 0.1265534
## Epoch: Iteration # 300 error is: 0.125092
## Epoch: Iteration # 400 error is: 0.1248576
## Epoch: Iteration # 500 error is: 0.1248467
## Epoch: Iteration # 600 error is: 0.1248464
## Epoch: Iteration # 700 error is: 0.1248464
## Epoch: Iteration # 800 error is: 0.1248464
## Epoch: Iteration # 900 error is: 0.1248464
## Epoch: Iteration # 1000 error is: 0.1248464
## Performing Kmeans.
## Performing t-SNE.
## Epoch: Iteration # 100 error is: 9.929153
## Epoch: Iteration # 200 error is: 0.1860103
## Epoch: Iteration # 300 error is: 0.1371697
## Epoch: Iteration # 400 error is: 0.1351052
## Epoch: Iteration # 500 error is: 0.1327441
## Epoch: Iteration # 600 error is: 0.1301649
## Epoch: Iteration # 700 error is: 0.1275335
## Epoch: Iteration # 800 error is: 0.1255832
## Epoch: Iteration # 900 error is: 0.1247959
## Epoch: Iteration # 1000 error is: 0.1246944
Note: The SIMLR results are not incluced in the figure since the average silhouette width values are misleading; the batch effect is not removed. The deceptively good average silhouette widths are a product of SIMLR’s low-dimensional representation of the data: the distance between biological clusters is very large and these clusters are compact. However, they fail to remove the batch effect.